title: 'Data scientists salary'¶
author: 'Gamaliel Mendoza'
date: 26/May/25
This notebook performs an analysis of data scientists' salary as of 2024.¶
- The analysis is based on:
- Records data from 2020-2022 by the Hugging face
- Data for 2024 data from Kaggle
- The analysis procedure was:
- Queries with SQLite3
- Conversion to DataFrames
- Statistical analyis:
- One-way analysis of variance (for the first dataset)
- Linear regression (for the 2024 dataset)
sol=0
jpl=0
bollyn=bool(sol)
if bollyn == True:
print('Downloading')
if jpl == 0:
!pip install kagglehub huggingface_hub ipython-sql scipy
print('Downloaded with pip')
elif jpl ==1:
softs=["kagglehub","seaborn","huggingface_hub","ipython-sql","scipy"]
print('Downloading with piplite')
import piplite
for s in softs:
print(s)
await piplite.install(s)
print(s, 'installed')
print('Done')
else:
print('Libraries already downloaded')
Libraries already downloaded
#await piplite.install("pretty_jupyter")
import kagglehub
import warnings
import pandas as pd
import prettytable
prettytable.DEFAULT = 'DEFAULT'
import sqlite3
warnings.simplefilter(action='ignore', category=FutureWarning)
# Download latest version
import pandas as pd
import os
from matplotlib import pyplot as plt
# The two datasets are downloaded with with API resources
path = kagglehub.dataset_download("adilshamim8/salary-prediction-salary-dataset")
fls=os.listdir(path)
df=pd.read_csv(path+'/'+fls[0])
dfTemp = pd.read_csv("hf://datasets/hugginglearners/data-science-job-salaries/ds_salaries.csv")
# Tables and connections are created for the SQL tables
%reload_ext sql
con=sqlite3.connect('KaggleHF_DS.db')
cursor=con.cursor()
%sql sqlite:///KaggleHF_DS.db
df.to_sql("salary_prediction_updated_datas",con,if_exists='replace', index=False)
dfTemp.to_sql("ds_salaries",con,if_exists='replace', index=True)
607
Verify data presentation and manageability¶
# Query for THF's to check it's loaded
%sql Select distinct * from ds_salaries limit 5;
* sqlite:///DS_sal.db Done.
| index | Unnamed: 0 | work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2020 | MI | FT | Data Scientist | 70000 | EUR | 79833 | DE | 0 | DE | L |
| 1 | 1 | 2020 | SE | FT | Machine Learning Scientist | 260000 | USD | 260000 | JP | 0 | JP | S |
| 2 | 2 | 2020 | SE | FT | Big Data Engineer | 85000 | GBP | 109024 | GB | 50 | GB | M |
| 3 | 3 | 2020 | MI | FT | Product Data Analyst | 20000 | USD | 20000 | HN | 0 | HN | S |
| 4 | 4 | 2020 | SE | FT | Machine Learning Engineer | 150000 | USD | 150000 | US | 50 | US | L |
# Query for Kaggle's to check it's loaded
%sql Select distinct * from salary_prediction_updated_datas limit 5;
* sqlite:///DS_sal.db Done.
| YearsExperience | Education | Job | Salary |
|---|---|---|---|
| 14.98 | PhD | Marketing Specialist | 6488.436666666667 |
| 38.03 | PhD | Software Engineer | 13906.698333333334 |
| 29.28 | Bachelor's | Business Analyst | 10297.772500000001 |
| 23.95 | Master's | Marketing Specialist | 7512.923333333333 |
| 6.24 | Master's | Business Analyst | 4660.950833333333 |
General oveview (between 2020-22)¶
import seaborn as sns
yrs=pd.read_sql_query("select work_year from ds_salaries where job_title like 'Data Scientist'",con)
AllDS=pd.read_sql_query("select * from ds_salaries where job_title like 'Data Scientist'",con)
AllCs=AllDS.columns
Years=pd.unique(yrs['work_year'])
bxs=pd.DataFrame()
for y,yr in enumerate(Years):
bxyr=pd.read_sql_query(f"select salary_in_usd from ds_salaries where job_title like 'Data Scientist' and work_year = {yr} ",con)
bxs.loc[0:len(bxyr['salary_in_usd']),f'{yr}']=bxyr.loc[:,'salary_in_usd']/12
boxes=pd.DataFrame(bxs)
# Plot the orbital period with horizontal boxes
#sns.boxplot(data=boxes)
# Add in points to show each observation
import operator
places=[2,8,9]
getter=operator.itemgetter(*places)
cols=list(getter(AllCs))
dbsns=AllDS[cols]
fig,axes=plt.subplots();
#sns.swarmplot(data=dbsns,x=dbsns['work_year'],y=dbsns['salary_in_usd'],hue='employee_residence')
ax=sns.violinplot(data=bxs)
for violin, alpha in zip(ax.collections[::1], [0.8,0.6,0.4,0.2]):
violin.set_alpha(.25)
#axes.plot(mng,color='k',)
sns.boxplot(data=dbsns,x=dbsns['work_year'],y=dbsns['salary_in_usd']/12,hue='employee_residence')
#sns.swarmplot(data=boxes)
sns.despine(offset=10, trim=True)
plt.ylabel('Salary USD')
plt.xlabel('Year');
# Improve the legend
sns.move_legend(
axes, loc="upper right", ncol=6, frameon=False, columnspacing=1, handletextpad=0,
)
#Boxes=pd.DataFrame.groupby(by=TempDS['work_year'])
#TempDS.boxplot(column=Boxes,by=TempDS['salary_in_usd'])
Historically, Data Scientists earned:¶
DSt=bxs.describe()
samples=DSt[DSt.columns].loc['count']
means=DSt[DSt.columns].loc['mean']
DSt
But, is it different?¶
from scipy.stats import f_oneway
f, p = f_oneway(bxs['2020'],bxs['2021'],bxs['2022'])
print(f"A comparison between the analyzed periods with ANOVA says it {'is significant' if p<0.05 else 'is not significant'}, with p= {round(p,4)}")
And, well said, "practice makes the master'.¶
%sql alter table salary_prediction_updated_datas rename 'Job Role' to Job;
%sql alter table salary_prediction_updated_datas rename 'Education Level' to Education;
%sql update salary_prediction_updated_datas set salary = salary/12;
As of 2025, Data scientists come from different degrees:¶
careers=pd.read_sql_query("select Distinct education from salary_prediction_updated_datas",con)
levs=[]
for carre in careers['Education']:
vs=pd.read_sql_query(f"select * from salary_prediction_updated_datas where Education like '%{carre[0:3]}%' AND Job = 'Data Scientist' limit 1",con)
print()
levs.append({"Years of Experience":vs.loc[0,"YearsExperience"],"Degree":vs.loc[0,"Education"],"Job Title":vs.loc[0,"Job"],"Average month salary":round(vs.loc[0,"Salary"],2)})
T=pd.DataFrame(levs)
print(T)
# We need to extract the data from the table.
# One way to do wo is by extracting the queries from the tables as shown below
import re
phds=pd.read_sql_query("select * from salary_prediction_updated_datas where Education like 'PhD' AND Job = 'Data Scientist'",con);
bach=pd.read_sql_query("select * from salary_prediction_updated_datas where Education like '%Bachelor%' AND Job LIKE \'%Data%\'",con);
mast=pd.read_sql_query("select * from salary_prediction_updated_datas where Education like '%Master%' AND Job LIKE '%Data%'",con);
allDS=pd.read_sql_query("select * from salary_prediction_updated_datas where Job = 'Data Scientist'",con);
allDS2=cursor.execute("select * from salary_prediction_updated_datas where Job LIKE '%Data%'")
allDS2=pd.DataFrame(cursor.fetchall())
allDS2.rename(columns={0:"Index",1:"YearsExperience",2:"Education Level",3:"Job Role",4:"Salary"});
DSs=df[df['Job Role']=="Data Scientist"]
DSs.loc[:,'Salary']=DSs.loc[:,'Salary']/12
Baches=DSs[DSs['Education Level']=="Bachelor's"]
Masters=DSs[DSs['Education Level']=="Master's"]
Docs=DSs[DSs['Education Level']=="PhD"]
Data scientists' income is linear with respect to experience¶
#import libraries
from matplotlib import pyplot as plt
import seaborn as sb
import numpy as np
import scipy
print('Libs uploaded')
#Plotted the Salary vs the years of experience
#Create a joint plot
dtpt=DSs
fig= plt.figure
j=sb.jointplot(data=dtpt,x='YearsExperience',y='Salary',ratio=10,hue='Education Level',marker='+',marginal_ticks=True)
x=np.array(dtpt['YearsExperience'])
y=np.array(dtpt['Salary'])
X=x[np.isnan(x)==False]
Y=y[np.isnan(y)==False]
minl=min([len(X),len(Y)])
X=X[0:minl]
Y=Y[0:minl]
# Obtained correlation coefficients for the data
corrs=scipy.stats.pearsonr(X,Y)
r,p=scipy.stats.pearsonr(X,Y)
# Obtained the slope and intercept as a measur for initial salaries from (x,y)
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
intercept,slope=estimate_coef(X,Y)
r,p=scipy.stats.pearsonr(X,Y)
#otro, = j.ax_joint.plot([], [], linestyle="", alpha=0)
#j.ax_joint.legend([otro],['y=mx+b; y={:3.2f}x+{:3.2f}, r={:.3}, p={:2.3e}'.format(b_1,b_0,r,p)])
#j.ax_joint.legend([otro],['r={:.3}, p={:.3e}'.format(b_1,b_0,r,p)])
plt.suptitle('Data Scientists\' Salary');
print(['Linear regression: y=mx+b; y={:3.2f}x+{:3.2f}, r={:.3}, p={:.4e}'.format(intercept,slope,r,p)])
And they come from different backgrounds:¶
numofClasses=pd.read_sql_query("select Education, count(*) as freq from salary_prediction_updated_datas where Job = 'Data Scientist' group by Education",con);
numofClasses
But everyone has the same chance.¶
TablaDeg=[]
for degree in numofClasses['Education']:
base=DSs[DSs['Education Level']==degree]
x=np.array(base['YearsExperience'])
y=np.array(base['Salary'])
X=x[np.isnan(x)==False]
Y=y[np.isnan(y)==False]
minl=min([len(X),len(Y)])
X=X[0:minl]
Y=Y[0:minl]
r,p=scipy.stats.pearsonr(X,Y)
intercept,slope=estimate_coef(X,Y)
TablaDeg.append({"Degree":degree,"n=":x.shape[0],"Raise":slope,"Initial":intercept,"R\N{superscript two}":r,"P-value":round(p,4)})
Tabla=pd.DataFrame(TablaDeg)
print(Tabla)
fig, ax=plt.subplots()
ini=ax.plot(Tabla['Degree'],Tabla['Initial'], color='gray',label='Initial')#,
plt.ylabel('Initial salary (monthly)');
plt.xticks(rotation=15);
slp=plt.twinx()
rs=slp.plot(Tabla['Degree'],Tabla['Raise'],color='Blue',label='Raise')
fig.suptitle('Salary per education level');
plt.ylabel('Raise/year');
#plt.legend([ini,rs],['Initial','Raise'])
Take-home message¶
- From 2020, Data Scientists have been better paid year after year
- How would it be related to different sector in the actual highly-advanced technological world?
- The education level at which data scientists start their career is not as important as one might think.
- Everyone starts at the same point.
- Some of them earn ~$30 less than the others at the beginning.
- And they increase their salary at the same rate.
- Everyone starts at the same point.
try:
!jupyter nbconvert dsp001.ipynb --to html --template pj
except Exception as e:
print('HTML not stored')